9 Working with multiple datasets
It is often useful to bring separate datasets together in order to carry out a given analytic task. For instance, one might want to compare change over time by bringing together census datasets from different years (though it’s worth noting that temporal analysis with census data can be tricky, and must be undertaken with caution, due to changes in calculation methods or geographic borders across different census years).
For example, let’s say we want to visualize the percentage point change in every state’s rural population share between 2000 and 2010. In other words, how did the percentage of each states rural population (with respect to its overall population) change across the first decade of the 2000s? Which states saw the greatest degree of rural depopulation, and which states saw the lowest degree of rural depopulation?
Let’s begin with the population_rural_2000_2010_list object that we created in the previous section. It’s a list that contains information on the percentage of each state’s population living in rural areas; the first list element contains information on states’ rural population share from the 2000 census, while the second contains analogous information from the 2010 census.
# Prints contents of "population_rural_2000_2010_list"
population_rural_2000_2010_list## $`2000`
## # A tibble: 52 x 6
## GEOID NAME P001001 P002005 rural_pct year
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01 Alabama 4447100 1981427 44.6 2000
## 2 02 Alaska 626932 215675 34.4 2000
## 3 04 Arizona 5130632 607097 11.8 2000
## 4 05 Arkansas 2673400 1269221 47.5 2000
## 5 06 California 33871648 1881985 5.56 2000
## 6 08 Colorado 4301261 668076 15.5 2000
## 7 09 Connecticut 3405565 417506 12.3 2000
## 8 10 Delaware 783600 155842 19.9 2000
## 9 11 District of Columbia 572059 0 0 2000
## 10 12 Florida 15982378 1712358 10.7 2000
## # … with 42 more rows
##
## $`2010`
## # A tibble: 52 x 6
## GEOID NAME P001001 P002005 rural_pct year
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01 Alabama 4779736 1957932 41.0 2010
## 2 02 Alaska 710231 241338 34.0 2010
## 3 04 Arizona 6392017 651358 10.2 2010
## 4 05 Arkansas 2915918 1278329 43.8 2010
## 5 06 California 37253956 1880350 5.05 2010
## 6 22 Louisiana 4533372 1215567 26.8 2010
## 7 21 Kentucky 4339367 1806024 41.6 2010
## 8 08 Colorado 5029196 696435 13.8 2010
## 9 09 Connecticut 3574097 429155 12.0 2010
## 10 10 Delaware 897934 149985 16.7 2010
## # … with 42 more rows
In order to visualize the percentage point difference in state-level rural population shares between 2010 and 2000, we can join the two datasets in population_rural_2000_2010_list, compute the difference between “rural_pct” in the 2010 dataset and “rural_pct” in the 2000 dataset, and then use now-familiar visualization techniques to visualize this difference.
Before joining the two datasets, it might make sense to clean them up to avoid any confusion after we implement the join. First, it would be nice to rename the “rural_pct” variable in the two datasets to indicate the relevant year, i.e. in the form “rural_pct_YEAR” where YEAR takes on the value in the existing “year” column. After renaming the “rural_pct” variable in this way, we can go ahead and delete the “P001001”, “P002005”, and “year” columns, since they do not contain necessary information, and will simply clutter the final (joined) dataset.
9.1 Using functions and iteration to clean multiple datasets
To clean up the datasets in this way, let’s write a function that makes these changes, and use now-familiar iteration techniques to apply the changes to both of the list elements in population_rural_2000_2010_list.
# Write function to clean dataset according to the requirements specified above and assign to an object named "cleaning_function"
cleaning_function<-function(dataset){
year_vector<-dataset$year # extracts the "year" column as a vector
year_extracted<-year_vector[1] # plucks out the first element of "year_vector"
desired_name<-paste0("rural_pct", "_", year_extracted) # creates desired name string
names(dataset)[5]<-desired_name # renames "rural_pct" with "desired_name"
final<-dataset %>% select(-c(P001001, P002005, year)) # deletes specified columns
return(final) # returns cleaned dataset
}# applies "cleaning_function" to both of the list elements in ```population_rural_2000_2010_list``` and assigns the result to a new object named "population_rural_2000_2010_list_cleaned"
population_rural_2000_2010_list_cleaned<-map(population_rural_2000_2010_list, cleaning_function)# Prints contents of "population_rural_2000_2010_list_cleaned"
population_rural_2000_2010_list_cleaned## $`2000`
## # A tibble: 52 x 3
## GEOID NAME rural_pct_2000
## <chr> <chr> <dbl>
## 1 01 Alabama 44.6
## 2 02 Alaska 34.4
## 3 04 Arizona 11.8
## 4 05 Arkansas 47.5
## 5 06 California 5.56
## 6 08 Colorado 15.5
## 7 09 Connecticut 12.3
## 8 10 Delaware 19.9
## 9 11 District of Columbia 0
## 10 12 Florida 10.7
## # … with 42 more rows
##
## $`2010`
## # A tibble: 52 x 3
## GEOID NAME rural_pct_2010
## <chr> <chr> <dbl>
## 1 01 Alabama 41.0
## 2 02 Alaska 34.0
## 3 04 Arizona 10.2
## 4 05 Arkansas 43.8
## 5 06 California 5.05
## 6 22 Louisiana 26.8
## 7 21 Kentucky 41.6
## 8 08 Colorado 13.8
## 9 09 Connecticut 12.0
## 10 10 Delaware 16.7
## # … with 42 more rows
9.2 Joining datasets
Now, let’s merge these datasets together, using “GEOID” as the join variable:
# Joins datasets in list and assigns joined dataset to object named "rural_change"
rural_change<-full_join(population_rural_2000_2010_list_cleaned[["2010"]], population_rural_2000_2010_list_cleaned[["2000"]], by="GEOID")# cleans "rural_change" by deleting "NAME.y" and renaming "NAME.x", and then creates new variable that captures the difference between "rural_pct_2010" and "rural_pct_2000"
rural_change<-rural_change %>%
select(-(NAME.y)) %>%
rename(NAME=NAME.x) %>%
mutate(rural_share_change=rural_pct_2010-rural_pct_2000)# prints modified "rural_change" object
rural_change## # A tibble: 52 x 5
## GEOID NAME rural_pct_2010 rural_pct_2000 rural_share_change
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 01 Alabama 41.0 44.6 -3.59
## 2 02 Alaska 34.0 34.4 -0.421
## 3 04 Arizona 10.2 11.8 -1.64
## 4 05 Arkansas 43.8 47.5 -3.64
## 5 06 California 5.05 5.56 -0.509
## 6 22 Louisiana 26.8 27.4 -0.560
## 7 21 Kentucky 41.6 44.2 -2.62
## 8 08 Colorado 13.8 15.5 -1.68
## 9 09 Connecticut 12.0 12.3 -0.252
## 10 10 Delaware 16.7 19.9 -3.18
## # … with 42 more rows
We can now use this joined dataset to visualize the extent of rural depopulation from 2000 to 2010, across the various states:
rural_change_graph<-rural_change %>%
ggplot(aes(x = reorder(NAME,rural_share_change), y=rural_share_change)) +
geom_col()+
coord_flip()+
labs(title="Rural Depopulation", x="State Name", y="Pct Change in Rural Population")+
theme(plot.title=element_text(hjust=0.5))
rural_change_graph 
To make a map:
state_population_2010_geo<-get_decennial(geography = "state",
variables = "P001001",
year = 2010,
geometry=TRUE)
state_population_2010_geo## Simple feature collection with 52 features and 4 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -179.1473 ymin: 17.88481 xmax: 179.7785 ymax: 71.35256
## geographic CRS: NAD83
## # A tibble: 52 x 5
## GEOID NAME variable value geometry
## <chr> <chr> <chr> <dbl> <MULTIPOLYGON [°]>
## 1 23 Maine P001001 1.33e6 (((-67.61976 44.51975, -67.61541 44.52197, -67.58774 44.5162, -67…
## 2 25 Massachuset… P001001 6.55e6 (((-70.83204 41.6065, -70.82373 41.59857, -70.82092 41.58767, -70…
## 3 26 Michigan P001001 9.88e6 (((-88.68443 48.11578, -88.67563 48.12044, -88.67639 48.12488, -8…
## 4 30 Montana P001001 9.89e5 (((-104.0577 44.99743, -104.2501 44.99822, -104.4701 44.99845, -1…
## 5 32 Nevada P001001 2.70e6 (((-114.0506 37.0004, -114.05 36.95777, -114.0506 36.84313, -114.…
## 6 34 New Jersey P001001 8.79e6 (((-75.52684 39.65571, -75.52634 39.65641, -75.52234 39.66081, -7…
## 7 36 New York P001001 1.94e7 (((-71.94356 41.28667, -71.9268 41.29012, -71.93526 41.28058, -71…
## 8 37 North Carol… P001001 9.54e6 (((-82.60288 36.03983, -82.60074 36.03742, -82.59878 36.03416, -8…
## 9 39 Ohio P001001 1.15e7 (((-82.81349 41.72347, -82.81049 41.72052, -82.80887 41.70833, -8…
## 10 42 Pennsylvania P001001 1.27e7 (((-75.41504 39.80179, -75.42804 39.80921, -75.45374 39.82031, -7…
## # … with 42 more rows
rural_change_geo<-full_join(state_population_2010_geo, rural_change, by="GEOID")
rural_change_geo## Simple feature collection with 52 features and 8 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -179.1473 ymin: 17.88481 xmax: 179.7785 ymax: 71.35256
## geographic CRS: NAD83
## # A tibble: 52 x 9
## GEOID NAME.x variable value geometry NAME.y rural_pct_2010 rural_pct_2000
## <chr> <chr> <chr> <dbl> <MULTIPOLYGON [°]> <chr> <dbl> <dbl>
## 1 23 Maine P001001 1.33e6 (((-67.61976 44.51975, -… Maine 61.3 59.8
## 2 25 Massa… P001001 6.55e6 (((-70.83204 41.6065, -7… Massa… 8.03 8.63
## 3 26 Michi… P001001 9.88e6 (((-88.68443 48.11578, -… Michi… 25.4 25.3
## 4 30 Monta… P001001 9.89e5 (((-104.0577 44.99743, -… Monta… 44.1 45.9
## 5 32 Nevada P001001 2.70e6 (((-114.0506 37.0004, -1… Nevada 5.80 8.49
## 6 34 New J… P001001 8.79e6 (((-75.52684 39.65571, -… New J… 5.32 5.65
## 7 36 New Y… P001001 1.94e7 (((-71.94356 41.28667, -… New Y… 12.1 12.5
## 8 37 North… P001001 9.54e6 (((-82.60288 36.03983, -… North… 33.9 39.8
## 9 39 Ohio P001001 1.15e7 (((-82.81349 41.72347, -… Ohio 22.1 22.6
## 10 42 Penns… P001001 1.27e7 (((-75.41504 39.80179, -… Penns… 21.3 22.9
## # … with 42 more rows, and 1 more variable: rural_share_change <dbl>
preliminary_ruralchange_map<-tm_shape(rural_change_geo)+
tm_polygons(col="rural_share_change", n=6,style="jenks",palette="BuGn", midpoint=FALSE)
preliminary_ruralchange_mapstate_population_2010_geo<-get_decennial(geography = "state",
variables = "P001001",
year = 2010,
geometry=TRUE) %>%
tigris::shift_geometry()## Getting data from the 2010 decennial Census
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## Using Census Summary File 1
rural_change_geo<-full_join(state_population_2010_geo, rural_change, by="GEOID")final_ruralchange_map<-tm_shape(rural_change_geo)+
tm_polygons(col="rural_share_change",
breaks=c(-6,-4,-2, 0, 1, 2),
palette="YlGnBu",
midpoint=T,
title="Percentage Point\nDifference in Rural Population Share\n(2010 vs 2000)")+
tm_layout(frame=FALSE,
main.title="Rural Depopulation by State, 2000-2010",
main.title.position="center",
legend.outside=TRUE,
legend.title.size=1,
legend.outside.position = c("bottom"),
attr.outside=T)+
tm_credits("Source: US Census via tidycensus", position=c("right", "bottom"))
final_ruralchange_map## Credits not supported in view mode.